Nighttime lights are a uniquely human phenomenon. Every night we capture data on this form of human emissions with the Earth Observation System VIIRS. While we expect that this data will be able to tell us a great deal about the people living and working beneath the lights there are still many factors we need to evaluate in order to back up those assumption. In the course of this lesson we will be evaluating how the number of observations used to generate the average monthly radiance can be contributing the the month to month variability in the dataset.
It’s unlikely that a headlamp would be seen from space but we don’t know that for sure. We do know that the world is a complex place. For any automated task like the one we are working one there are numerous conditions that you probably are unaware of and couldn’t do much about even if you did know. Those assumptions, known and unknown carry through the analysis and have the potential to skew things.
At the end of last lesson we had created a correlation between yearly average radiance and poverty/age at the census tract level for three different counties in Texas. While the trends we’re not convincing, we assumed that the positive confirmation to our hypothesis is still out there and by conducting a quality control check on the input data we can pull out the truth.
Beware of the Sunk Cost Fallacy and loss aversion as it effects your research. Set your back out point before you start because there is almost always more you can do with data. The more we invest into something the less likely we are to back out.
The average monthly radiance values are delivered with an associated layer that reports the number of daily observations that were used to generate the mean monthly value for each observation location. These values can range from 0-31 depending on the month. We want to make sure that we are only looking at observation locations where we have a high degree of confidence that the values represent a true mean.
There are three reasons why this is worth evaluating
View Angle
The VIIRS sensor captures data in a 3,000km swath. This means there is a lot of area that is off nadir. As night lights are generally directional features (think lights on the side of the building) the angle at which the image is capture will effect the amount of radiance observed. The result of this is that some nightly images will be lower and some higher then the actual observed value at nadir. We can get around this by only working with on nadir passes, but those occur only every 14 days or so, that severely limits out total number of observations and would require a whole new workflow based on the daily images. Option 2 would be to ensure you have enough observation to average out that variability. How much is enough? We’ll try to find out.
lunar radiance
The moon is a large roundish rock that has influence culture and inspired the curiosity of people for as long as we’ve been around. We’ve always cared about the moon because it reflects the radiance of the sun back at the Earth. This means we can see it. Due to the orbit patterns of the three celestial body we end up with about half of each month being darker then the other half. The effects of lunar radiance are substaintial in darker areas especially when combined with some freshly fallen snow. Our counts data does not tell us anything about the date of the image captures but will just cross our figure and assume it’s normally distribution. Therefore the more observations we get, the less we need to work about the moon.
cloud cover
Clouds are like that really tall fat person wearing a fedora inside that five minutes into the concert decides to and stand right in front of you. They are in the way. Clouds are actually worse then the unsavory concert goer because they affect the night lights data in two ways; they block and defuse the nightlights.
If the clouds are dense, they will block the light entirely. The VIIRS data utilizes a thermal based cloud mask which is pretty good at capturing these dense features. We don’t get any information about night lights for those evenings in our montly averages. In a humid place like southeast Texas we can expect to have a large portion of our daily data obscured by clouds.
Low diffuse clouds and fog can diffuse the radiance and spread it out over a larger area. This means that bight areas are dimmed and darker areas can appear brighter. These clouds are close to the same temperature as the Earth surface so they are not effectively captured by the radar based could mask. It’s best to assume some of this fuzziness from the clouds will be part of the monthly averages. Same as before, the more observation the less that fuzziness will effect the average.
Answer: I don’t know. The good options I could think off all require working with daily observations at each location. That is a lot more data and a lot more work. Today we’re going to use more a generalize approach to visual multiple options. We’re not got to have a strong quantativity justification but well some observations that support or inevitable assumption. This is ok because we know it is an assumption and we can state that to others when they asses the work.
To tackle this probably we’re going to need understand exactly what the counts data is all about. So let’s get started by setting up our environment.
# load required libraries
library(sf)
library(raster)
library(dplyr)
library(tmap)
## change this to the directory where your folder is stored
baseDir <- "D:/R_SC_Spatial/intermediateGeospatialR"
Now we will pull the counts files.
#grab all counts images
images <- list.files(path = paste0(baseDir,"/data/nightLights"),
pattern = "_counts.tif",
full.names = TRUE)
images
## [1] "D:/R_SC_Spatial/intermediateGeospatialR/data/nightLights/april_10arc_counts.tif"
## [2] "D:/R_SC_Spatial/intermediateGeospatialR/data/nightLights/august_10arc_counts.tif"
## [3] "D:/R_SC_Spatial/intermediateGeospatialR/data/nightLights/december_10arc_counts.tif"
## [4] "D:/R_SC_Spatial/intermediateGeospatialR/data/nightLights/feburary_10arc_counts.tif"
## [5] "D:/R_SC_Spatial/intermediateGeospatialR/data/nightLights/janurary_10arc_counts.tif"
## [6] "D:/R_SC_Spatial/intermediateGeospatialR/data/nightLights/july_10arc_counts.tif"
## [7] "D:/R_SC_Spatial/intermediateGeospatialR/data/nightLights/june_10arc_counts.tif"
## [8] "D:/R_SC_Spatial/intermediateGeospatialR/data/nightLights/march_10arc_counts.tif"
## [9] "D:/R_SC_Spatial/intermediateGeospatialR/data/nightLights/may_10arc_counts.tif"
## [10] "D:/R_SC_Spatial/intermediateGeospatialR/data/nightLights/november_10arc_counts.tif"
## [11] "D:/R_SC_Spatial/intermediateGeospatialR/data/nightLights/october_10arc_counts.tif"
## [12] "D:/R_SC_Spatial/intermediateGeospatialR/data/nightLights/september_10arc_counts.tif"
We have not processed these images at all so they are still as big as Texas or only as small as a 40% of Alaska, however you want to look at it.
#read in image
temp1 <- raster::raster(images[1])
temp1
## class : RasterLayer
## dimensions : 650, 798, 518700 (nrow, ncol, ncell)
## resolution : 0.01666667, 0.01666667 (x, y)
## extent : -106.7271, -93.42708, 25.75208, 36.58542 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs
## source : D:/R_SC_Spatial/intermediateGeospatialR/data/nightLights/april_10arc_counts.tif
## names : april_10arc_counts
## values : 2, 20 (min, max)
Currently there are over 500,000 observations in the image and the values range from 2 to 20. 500,000 elements if not that large in regarded to raster imagery but it’s more then need to carry around for this exploratory work. We will cut down this image to the extent of a county before going further.
# grab a radiance image
allImages <- list.files(path = paste0(baseDir,"/data/nightLights"),
pattern = ".tif",
full.names = TRUE,
recursive = TRUE)
# print to find an image from a county
allImages[1:10]
## [1] "D:/R_SC_Spatial/intermediateGeospatialR/data/nightLights/april_10arc.tif"
## [2] "D:/R_SC_Spatial/intermediateGeospatialR/data/nightLights/april_10arc_counts.tif"
## [3] "D:/R_SC_Spatial/intermediateGeospatialR/data/nightLights/august_10arc.tif"
## [4] "D:/R_SC_Spatial/intermediateGeospatialR/data/nightLights/august_10arc_counts.tif"
## [5] "D:/R_SC_Spatial/intermediateGeospatialR/data/nightLights/Bexar/april_10arc.tif"
## [6] "D:/R_SC_Spatial/intermediateGeospatialR/data/nightLights/Bexar/august_10arc.tif"
## [7] "D:/R_SC_Spatial/intermediateGeospatialR/data/nightLights/Bexar/december_10arc.tif"
## [8] "D:/R_SC_Spatial/intermediateGeospatialR/data/nightLights/Bexar/feburary_10arc.tif"
## [9] "D:/R_SC_Spatial/intermediateGeospatialR/data/nightLights/Bexar/janurary_10arc.tif"
## [10] "D:/R_SC_Spatial/intermediateGeospatialR/data/nightLights/Bexar/july_10arc.tif"
Looks like images from Bexar County starts at position 5 for me. We will read that in and use itto trim out state wide image
# read in county processed image
r1 <- raster::raster(allImages[5])
# crop the raster
temp2 <- temp1 %>%
raster::crop(r1)
# pull attributes and view
temp2
## class : RasterLayer
## dimensions : 39, 42, 1638 (nrow, ncol, ncell)
## resolution : 0.01666667, 0.01666667 (x, y)
## extent : -98.81041, -98.11041, 29.11875, 29.76875 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs
## source : memory
## names : april_10arc_counts
## values : 2, 11 (min, max)
qtm(temp2)
We’re down to ~1,600 observations and our the high end of the number of observations drop from 20 to 11.
The map shows us that majority of the area seems to be below 8 observations.
Also note that we cropped the counts image based on the extent of the raster data from the given county. This returns a rectangle based on the bounding box of the county raster. All this means is some of this data will be outside of our area of interest, but as we just trying to learn about the counts data that is ok.
Visualize the values
There are many different ways to summarize a data in an non spatial format. Here are a few examples
#grab the values of the raster object
vals <- raster::values(temp2)
# summary() base R
summary(vals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 5.000 6.000 6.192 7.000 11.000
# plot a histogram
hist(vals)
The data is pretty close to normally distributed. The mean is slightly higher then the median so there is a bit of right skew. It’s good to see that only a limited number of locations had only two cloud free observations in April, that’s the bare minimum need to calculate a mean and is unlikely to account for all those potential error sources.
Let’s narrow this down a little more and look specific in the area of the county by creating a mask object from our radiance raster
# create a mask object.
mask <- r1
# reassing all positive values to 1
mask[mask >= 0, ] <- 1
# set any value not equal to 1 as NA
mask[mask != 1, ] <- NA
There are probably other ways to make a mask object but I’ve always liked this one. As the data within the raster object is effectively a matrix, you can perform indexing to reassign values as you would a matrix. The significance of this will become more apparent as we progress, but is super flexible and fast.
# mutiple raster to apply the mask
counts <- temp2 * mask
qtm(counts)
By multiplying the images together we keep all values that are within the mask area and reassign all values outside of it as NA. This is a algebraic operation between matrices so it’s fast too.
vals2 <- raster::values(counts)
summary(vals2)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.000 5.000 6.000 6.181 7.000 11.000 535
hist(vals2)
Changing our area of interest did not significantly effect out summary statistics, so we are working at a scale that still captures the spatial variability of the observations.
How much of a county counts as a county
Our end goal is that we will use the counts data to limit what locations are used in the computation of the yearly averages. By doing so we’re also limiting how much of the overall area is actually represented for a particular month. We need to find a balance between what we keep and drop. We can start by seeing what proportion of the county we are reporting on if we were to filter at various levels observed in the counts data.
# pull total number of observations
vals_noNA <- vals2[!is.na(vals2)]
total <- length(vals_noNA)
# determine sequence of interest
seq1 <- seq(min(vals_noNA),max(vals_noNA), by =1 )
For each element in the sequence we will want to remove that feature from the current list and determine the change in the total elements so we can calculate a change in the area. This is a great place for a function, as we’re applying the same process for each element in the sequence.
getArea <- function(values, index){
### values : vector of numerical features
### index : numerical value to filter on
# add na clause just to be safe
values <- values[!is.na(values)]
# get total
total <- length(values)
# get new values based on index
vals_new <- values[values >= index]
# calc average
ave <- 100*(length(vals_new)/ total)
return(ave)
}
This function performs a numerical filter on the vector of values and returns a percentage which in this case represent the total number of observations. We can translate this to an area measurement using arithmetic. But conceptual keeping it as a percentage seems easier to work with.
We need to apply the function and create a mechanism to store the output. While use a dataframe for this example.
# create a dataframe to store content
df <- data.frame(matrix(nrow = length(seq1), ncol = 2))
names(df) <- c("filter", "percent area")
# assign the filter element because we have it already
df$filter <- seq1
for(i in seq_along(seq1)){
# index column position using i, but define the filter value by seq1 feature
df$`percent area`[i] <- getArea(values = vals_noNA, index = seq1[i])
}
df
## filter percent area
## 1 2 100.00000000
## 2 3 99.90933817
## 3 4 98.18676337
## 4 5 90.66183137
## 5 6 68.81233001
## 6 7 41.43245694
## 7 8 15.86582049
## 8 9 2.71985494
## 9 10 0.45330916
## 10 11 0.09066183
By building out the full data frame before hand we are keeping the operation efficient. We needed to use the seq_alongfunction to ensure we could correctly index the position to store the information. Below is another means of accomplishing this using a counter which can be handy at times, specifically when you have to deal with conditional statements within your loop.
n = 1
for(i in seq1){
# index column position using i, but define the filter value by seq1 feature
df$`percent area`[n] <- getArea(values = vals_noNA, index = i)
n = n+1
}
df
We’ve got a number
So based on this test case we can sense that dropping all locations with six observations or less still give us 90% of the county to work with. The next question is how would applying such a filter effect the amount of nighttime lights observed at the county level.
This is a really tricky questions to answer because we don’t know anything about where the locations with less then 6 observations are at this point. We’ve been conducting these test on non spatial data. We also know that there are locations in the county that are very bright and many many more that are relatively dark(urban and rural areas). So at 10% reduction we’re probably not going to see a big change in the mean and median of the all the values in the county. But there is know way of knowing without trying it out.
We can starting testing this by bringing back in the spatial data while adpating our for loop from above.
# create a dataframe to store content
df <- data.frame(matrix(nrow = length(seq1), ncol = 4))
### adding new columns for mean and median
names(df) <- c("filter", "percent area", "mean", "median")
# assign the filter element because we have it already
df$filter <- seq1
We’ve added new columns to our data frame to store the mean and median radiance values for the county. So far we’ve been working with counts only so we will need to bring in the radiance image into the loop as well.
# Check to make sure the original feature we read in matches out month of interest
r1
## class : RasterLayer
## dimensions : 39, 42, 1638 (nrow, ncol, ncell)
## resolution : 0.01666667, 0.01666667 (x, y)
## extent : -98.81041, -98.11041, 29.11875, 29.76875 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs
## source : D:/R_SC_Spatial/intermediateGeospatialR/data/nightLights/Bexar/april_10arc.tif
## names : april_10arc
## values : 0.425001, 116.5439 (min, max)
temp1
## class : RasterLayer
## dimensions : 650, 798, 518700 (nrow, ncol, ncell)
## resolution : 0.01666667, 0.01666667 (x, y)
## extent : -106.7271, -93.42708, 25.75208, 36.58542 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs
## source : D:/R_SC_Spatial/intermediateGeospatialR/data/nightLights/april_10arc_counts.tif
## names : april_10arc_counts
## values : 2, 20 (min, max)
We got luckly here but will need to find a way to ensure the count and radiance image match temporally at some point. For now, let’s just figure out how to do this once. While start by drafting out the workflow.
## speculating on workflow
i <- "filter level"
## create a mask of the counts layer
counts[counts >= i, ] <- 1
counts[counts < i, ] <- NA
## apply the mask to the radiance layer
rad1 <- rad * counts
## remove all NA values
rad_vals <- raster::values(rad1)
rad_vals <- rad_vals[!is.na(rad_vals), ]
## calculate mean and median
df$mean <- mean(rad_vals)
df$median <- median(rad_vals)
Alright so we will need the clipped and masked counts raster and the monthly radiance images for each month. With those we can apply all the same methods we’ve used before to derive the mean and median radiance at the county level. As we can see the require inputs and outputs, so we can built it into a function.
radMeanAndMedian <- function(countRaster, radianceRaster, index){
## create a mask of the counts layer
countRaster[countRaster < index] <- NA
countRaster[countRaster >= index] <- 1
## apply the mask to the radiance layer
rad1 <- radianceRaster * countRaster
## remove all NA values
rad_vals <- raster::values(rad1)
rad_vals <- rad_vals[!is.na(rad_vals)]
## create a vector to store outputs
values <- c()
## calculate mean and median
values[1] <- mean(rad_vals)
values[2] <- median(rad_vals)
return(values)
}
We had to change where we are storing the mean and median values. We could return two objects, but I think it’s better to output a single feature and use indexing to access the data. With this new function in hand let’s adjust our existing workflow to fit around it.
# define input parameters
count_rastula <- counts
rad_rast <- raster::raster(allImages[5])
# determine sequence of filters
count_vals <- raster::values(count_rastula)
vals_noNA <- count_vals[!is.na(count_vals)]
seq1 <-seq(min(vals_noNA), max(vals_noNA), by = 1)
# loop over filter values
for(i in seq_along(seq1)){
# run the area function
df$`percent area`[i] <- getArea(values = vals_noNA, index = seq1[i])
# run the mean median function
meanMedian <- radMeanAndMedian(countRaster = count_rastula,
radianceRaster = rad_rast,
index = seq1[i])
# a vector is return with mean and median values index to assigning it to the correct positions
df[i,3:4] <- meanMedian
}
df
## filter percent area mean median
## 1 2 100.00000000 11.679998 4.518006
## 2 3 99.90933817 11.669617 4.515014
## 3 4 98.18676337 11.789309 4.588748
## 4 5 90.66183137 12.113821 4.980951
## 5 6 68.81233001 11.341239 4.686239
## 6 7 41.43245694 11.882842 5.769488
## 7 8 15.86582049 13.084229 6.895073
## 8 9 2.71985494 18.566201 13.547457
## 9 10 0.45330916 19.168742 3.585171
## 10 11 0.09066183 3.585171 3.585171
This looks great and we can see some more sigificant changes occuring between filter lever 6 and 7. Let’s take a second to visulize these results in a graphic
We’re going to be using a new library here called Plotly. What makes plotly stand out relative to ggplot2 is it’s ability to create interactive figures. This becomes particularly valuable when you’re utilizing rmd documents to generate report of your results.
# install and load package
# install.packages("plotly")
library(plotly)
### Plot a figure
p1 <- plot_ly()
p1
Plotly functions utilize the dplyr piping structure rather then the “+” operator like ggplot2. Both allow you to add to existing objects. This means we can start with a blank figure
p2 <- p1 %>%
add_trace(x=df$filter, y=df$`percent area`,type = 'scatter')
p2
The add trace function allows us to add elements to the figure. In this case we plot filter levels on the x and percent area on the y with the type defined as scatter.
While this is discrete data and points are the correct means of displaying it we can add a line to this feature to visualize the trend more clearly.
p3 <- p2%>%
add_trace(x=df$filter, y=df$`percent area`,type = 'scatter', line = list(dash = 'dash', shape= "spline"))
p3
Notice that a legend was added now that there are two features. As the two features are the same we simply don’t see one of them. The order in which we add features to the plotly object determines the visual hierarchy.
We don’t need two sets of the same data on the plot so let recreate this from the start and add some more text to describe the legends
p1 <- plot_ly() %>%
add_trace(x=df$filter, y=df$`percent area`,type = 'scatter', line = list(dash = 'dash', shape= "spline"))%>%
layout(xaxis = list(title = "Filter Level"),
yaxis = list(title = "Percentage of Coverage"))
p1
Ok that looks good, but we still have two other parameters to visualize. We can try to add these to the same figure or just create two more figures. We will go with the second option for now, using effectively the same structure.
Note we can add the parameters from the add trace call into the original plot_ly function. This is more of the standard. add_trace is generally use to add multiple elements to a plot. The end result is the same.
# mean plot
p2 <- plot_ly(x=df$filter, y=df$mean,type = 'scatter', line = list(dash = 'dash', shape= "spline")) %>%
layout(xaxis = list(title = "Filter Level"),
yaxis = list(title = "Mean"))
# median plot
p3 <- plot_ly() %>%
add_trace(x=df$filter, y=df$median,type = 'scatter', line = list(dash = 'dash', shape= "spline"))%>%
layout(xaxis = list(title = "Filter Level"),
yaxis = list(title = "Median"))
p2
p3
Ok, we’ve got three plots which is great. We can connect them to each other more directly by calling on another plotly function subplot
p<- plotly::subplot(p1,p2,p3, nrows = 3, shareX = TRUE, titleY = TRUE)
p
Due to the shape of the data I choose to stack the plots by calling nrows = 3. This works particularly well because all plots share the same x-axis making for a nice compact plot. The titleY parameter carries the titles from the original plots through to the final figure.
With this visualization we start to see that even though we loose a lot of area at filter level six, the mean and median for the county remain consistent. This means that we are still capturing the general quality of the nightlight at the county level spatial scale. 68% of the county could be a fair sample size for the county as a whole.
It this point we have a process developed for generating a data rich visualization that shows how filtering the radiance data based on the number of observations changes the average radiance of the county. This product is best suited for aiding a discussion around quantity and quality of observations. So far we’ve made it work once, on one month, for one county. If we wanted to be comprehensive in our assessment we would need to apply this process across;
We could structure this out within a series of loops but evaluating the result at the county level is probably more appropriate. It’s a concrete spatial scale that much of our current analysis is built on. As the results we are interest in showing gain utility from interactivity we can produce them via a rmd to html rather then a R script. An added bonus of the rmd process is that we can call the document directly from a R script in a similar way we call in a function. This is a lot to try to visualize so lets just go for it and discuss the details as we go along.